An introduction fo scMerge2

Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

Authors: Yingxin Lin².
Last modified: 2 August, 2023.

Overview

The recent emergence of multi-sample multi-condition single-cell multi-cohort studies allows researchers to investigate different cell states. The effective integration of multiple large-cohort studies promises biological insights into cells under different conditions that individual studies cannot provide.

Description

With the rapid emergence of multi-sample multi-condition single-cell studies and the increased number of datasets for integration, our proposed scMerge2 addresses challenges associated with scalability of cells and studies as well as producing analytically ready data (i.e. adjusted expression matrix). This is achieved via three key innovations compared to the previous version of scMerge:

Hierarchical integration is used to capture both local and global variation. scMerge2 provides users with a more flexible and adaptable multi-level merging structure, of which each level can comprise multiple collections of several batches and batch correction can be performed within each collection separately using user-defined batch labels.
Pseudo-bulk construction is used to reduce computing load, allowing for the analysis of datasets containing millions of cells.
Pseudo-replication inside each condition is built, allowing for the modelling of numerous conditions.

In essence, scMerge2 takes gene expression matrices from a collection of datasets and integrates them in a hierarchical manner. The final output of scMerge2 is a single adjusted expression matrix with all input data matrices merged and ready for downstream analysis.

Pre-requisites

It is expected that students will have:

Basic knowledge of R syntax
Familiarity with SingleCellExperiment objects

Participation

While it will be possible for participants to run code as we go through the demonstration, given time constraints, I would encourage them to focus their attention into integration strategies behind scMerge2 (pseudo-replicates, pseudo-bulk, stably expressed genes, number of unwatned variation factors and hierarchical mering etc.). Questions are welcome both within the workshop and if participants choose to work through the workshop independently after the demonstration.

R / Bioconductor packages used

This workshop will focus on Bioconductor packages scMerge and [SingleCellExperiment] (https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html).

Time outline

An example for a 45-minute workshop:

Activity	Time
Introduction of method	10m
Introduction of core function	10m
Hierarchical merging	10m
Best practise	10m

Workshop

Loading Packages and Data

We will load the scMerge package. We designed our package to be consistent with the popular BioConductor’s single cell analysis framework, namely the SingleCellExperiment and scater package.

suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(scMerge)
    library(scater)
    library(scMergeBioc2023)
    library(ggpubr)
})

We provided an illustrative mouse embryonic stem cell (mESC) data in our package, as well as a set of pre-computed stably expressed gene (SEG) list to be used as negative control genes.

The full curated, unnormalised mESC data can be found here. The scMerge package comes with a sub-sampled, two-batches version of this data (named “batch2” and “batch3” to be consistent with the full data) .

## Subsetted mouse ESC data
data("example_sce", package = "scMerge")
data("segList_ensemblGeneID", package = "scMerge")

In this mESC data, we pooled data from 2 different batches from three different cell types. Using a PCA plot, we can see that despite strong separation of cell types, there is also a strong separation due to batch effects. This information is stored in the colData of example_sce.

example_sce <- runPCA(example_sce, exprs_values = "logcounts")

scater::plotPCA(example_sce, 
                colour_by = "cellTypes", 
                shape_by = "batch")

scMerge2

Unsupervised `scMerge2`

In unsupervised scMerge2, we will perform graph clustering on shared nearest neighbour graphs within each batch to obtain pseudo-replicates. This requires the users to supply a k_celltype vector with the number of neighbour when constructed the nearest neighbour graph in each of the batches. By default, this number is 10.

scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         verbose = FALSE)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.

assay(example_sce, "scMerge2") <- scMerge2_res$newY

set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')

Semi-supervised `scMerge2`

When cell type information are known (e.g. results from cell type classification using reference), scMerge2 can use this information to construct pseudo-replicates and identify mutual nearest groups with cellTypes input.

scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         cellTypes = example_sce$cellTypes,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         verbose = FALSE)


assay(example_sce, "scMerge2") <- scMerge2_res$newY

example_sce = scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')

More details of scMerge2

Number of pseudobulk

The number of pseudobulk constructed within each cell grouping is set via k_pseudoBulk. By default, this number is set as 30. A larger number will create more pseudo-bulk data in model estimation, with longer time in estimation.

scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         k_pseudoBulk = 50,
                         verbose = FALSE)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.


assay(example_sce, "scMerge2") <- scMerge2_res$newY

set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')

Return matrix by batch

When working with large data, we can get the adjusted matrix for a smaller subset of cells each time. This can be achieved by setting return_matrix to FALSE in scMerge2() function, which the function then will not return the adjusted whole matrix but will output the estimated fullalpha. Then to get the adjusted matrix using the estimated fullalpha, we first need to performed cosine normalisation on the logcounts matrix and then calculate the row-wise (gene-wise) mean of the cosine normalised matrix (This is because by default, scMerge2() perform cosine normalisation on the log-normalised matrix before RUVIII step). Then we can use getAdjustedMat() to adjust the matrix of a subset of cells each time.

scMerge2_res <- scMerge2(exprsMat = logcounts(example_sce),
                         batch = example_sce$batch,
                         ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                         verbose = FALSE,
                         return_matrix = FALSE)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.

cosineNorm_mat <- batchelor::cosineNorm(logcounts(example_sce))
adjusted_means <- DelayedMatrixStats::rowMeans2(cosineNorm_mat)

newY <- list()
for (i in levels(example_sce$batch)) {
    newY[[i]] <- getAdjustedMat(cosineNorm_mat[, example_sce$batch == i], 
                                scMerge2_res$fullalpha,
                                ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                                ruvK = 20,
                                adjusted_means = adjusted_means)
}
newY <- do.call(cbind, newY)

assay(example_sce, "scMerge2") <- newY[, colnames(example_sce)]

set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                                       
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')

Note that we can also adjust only a subset of genes by input a gene list in return_subset_genes in both getAdjustedMat() and scMerge2().

Hierarchical scMerge2

scMerge2 provides a hierarchical merging strategy for data integration that requires multiple level adjustment through function scMerge2h(). For example, for the dataset with multiple samples, we may want to remove the sample effect first within the dataset before integrate it with other datasets. Below, we will illustrate how we can build a hierarchical merging order as input for scMerge2h().

For illustration purpose, here I first create a fake sample information for the sample data. Now, each batch has two samples.

# Create a fake sample information
example_sce$sample <- rep(c(1:4), each = 50)
table(example_sce$sample, example_sce$batch)
#>    
#>     batch2 batch3
#>   1     50      0
#>   2     50      0
#>   3      0     50
#>   4      0     50

To perform scMerge2h, we need to create

a hierarchical index list that indicates the indices of the cells that are going into merging;
a batch information list that indicates the batch information of each merging.

Scenario 1

We will first illustrate that a two-level merging case, where the first level refers to the sample effect removal within each batch, and the second level refers to the merging of two batches.

First, we will construct the hierarchical index list (h_idx_list). The hierarchical index list is a list that indicates for each level, which indices of the cells are going into merging. The number of the element of the list should be the same with the number of level of merging. For each element, it should contain a list of vectors of indices of each merging.

# Construct a hierarchical index list
h_idx_list <- list(level1 = split(seq_len(ncol(example_sce)), example_sce$batch),
                   level2 = list(seq_len(ncol(example_sce))))

On level 1, we will perform two merging, one for each batch. Therefore, we have a list of two vectors of indices. Each indicates the indices of the cells of the batches.

h_idx_list$level1
#> $batch2
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
#>  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
#>  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
#>  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
#>  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
#>  [91]  91  92  93  94  95  96  97  98  99 100
#> 
#> $batch3
#>   [1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
#>  [19] 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
#>  [37] 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
#>  [55] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
#>  [73] 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
#>  [91] 191 192 193 194 195 196 197 198 199 200

On level 2, we will perform one merging to merge two batches. Therefore, we have a list of one vector of indices, indicates all the indices of the cells of the full data.

h_idx_list$level2
#> [[1]]
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
#>  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
#>  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
#>  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
#>  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
#>  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
#> [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
#> [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
#> [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
#> [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
#> [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
#> [199] 199 200

Next, we need to create the batch information list (batch_list), which has the same structure with the h_idx_list, but indicates the batch label of the h_idx_list.

# Construct a batch information list
batch_list <- list(level1 = split(example_sce$sample, example_sce$batch),
                   level2 = list(example_sce$batch))

We can see batch_list indicates the batch label for level of the merging.

batch_list$level1
#> $batch2
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> 
#> $batch3
#>   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#>  [75] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Now, we can input the batch_list and h_idx_list in scMerge2h to merge the data hierarchically. We also need to input a ruvK_list, a vector of number of unwanted variation (k in RUV model) for each level of the merging. We suggest a lower ruvK in the first level. Here we set ruvK_list = c(2, 5) which indicates we will set RUV’s k equal to 2 in the first level and 5 in the second level.

scMerge2_res <- scMerge2h(exprsMat = logcounts(example_sce),
                          batch_list = batch_list,
                          h_idx_list = h_idx_list,
                          ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                          ruvK_list = c(2, 5),
                          verbose = FALSE)

The output of scMerge2h is a list of matrices indicates the adjusted matrix from each level.

length(scMerge2_res)
#> [1] 2
lapply(scMerge2_res, dim)
#> [[1]]
#> [1] 1047  200
#> 
#> [[2]]
#> [1] 1047  200

Here we will use the adjusted matrix from the last level as the final adjusted matrix.

assay(example_sce, "scMerge2") <- scMerge2_res[[length(h_idx_list)]]
set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                           
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')

Scenario 2

scMerge2h can handle a flexible merging strategy input. For example, on level 1 above, we can only merge data for one batch. As an example, we can start from modify the batch index list and hierarchical index list to remove the list of batch 2 on level 1.

h_idx_list2 <- h_idx_list
batch_list2 <- batch_list
h_idx_list2$level1$batch2 <- NULL
batch_list2$level1$batch2 <- NULL
print(h_idx_list2)
#> $level1
#> $level1$batch3
#>   [1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
#>  [19] 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
#>  [37] 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
#>  [55] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
#>  [73] 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
#>  [91] 191 192 193 194 195 196 197 198 199 200
#> 
#> 
#> $level2
#> $level2[[1]]
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
#>  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
#>  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
#>  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
#>  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
#>  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
#> [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
#> [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
#> [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
#> [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
#> [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
#> [199] 199 200
print(batch_list2)
#> $level1
#> $level1$batch3
#>   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#>  [75] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#> 
#> 
#> $level2
#> $level2[[1]]
#>   [1] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [11] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [21] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [31] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [41] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [51] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [61] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [71] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [81] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#>  [91] batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2 batch2
#> [101] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [111] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [121] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [131] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [141] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [151] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [161] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [171] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [181] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> [191] batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3 batch3
#> Levels: batch2 batch3

scMerge2_res <- scMerge2h(exprsMat = logcounts(example_sce),
                          batch_list = batch_list2,
                          h_idx_list = h_idx_list2,
                          ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                          ruvK_list = c(2, 5),
                          verbose = FALSE)

assay(example_sce, "scMerge2") <- scMerge2_res[[length(h_idx_list)]]
set.seed(2022)
example_sce <- scater::runPCA(example_sce, exprs_values = 'scMerge2')                           
scater::plotPCA(example_sce, colour_by = 'cellTypes', shape = 'batch')

Phenotype specific pseudo-replicates

Next, we will use another dataset as an example with two conditions to demonstrate multi-condition data integration. The data is a subset of CITE-seq data from a COVID-19 study (Stephenson et al. 2021). Here we subset the data to only 4 individuals from 2 sites.

data(sce_covid)
sce_covid
#> class: SingleCellExperiment 
#> dim: 12037 3000 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(12037): LINC00115 NOC2L ... AL592183.1 AC240274.1
#> rowData names(0):
#> colnames(3000): BGCV03_GCGAGAAGTTTGGCGC-1 CCGTACTGTGTCAATC-MH8919332
#>   ... CTACATTAGATCCCGC-MH8919331 BGCV01_TCAGCTCAGGCTCAGA-1
#> colData names(6): meta_sample_id level1 ... Site meta_severity
#> reducedDimNames(2): PCA_ADT UMAP_ADT
#> mainExpName: NULL
#> altExpNames(1): ADT

We can first explore the data structure of this subset data. We can see we have 4 samples from 2 sites, each site has one sample from Healthy donor and one sample from Mild/Moderate COVID-19 patients.

ggplot(data.frame(colData(sce_covid)), 
       aes(x = meta_sample_id,
           fill = level2)) +
    geom_bar() +
    facet_wrap(~meta_severity + Site, scale = "free_y", ncol = 1) +
    theme_bw() +
    coord_flip()

We will first look at the RNA-seq data.

decomp <- scran::modelGeneVar(sce_covid)
hvg <- scran::getTopHVGs(decomp, n = 2000)
sce_covid <- scater::runPCA(sce_covid, 
                      exprs_values = 'logcounts', subset_row = hvg)           
sce_covid <- scater::runUMAP(sce_covid, dimred = "PCA")    
g1 <- scater::plotUMAP(sce_covid, colour_by = 'level2', shape = 'Site')
g2 <- scater::plotUMAP(sce_covid, colour_by = 'level3', shape = 'Site')
g3 <- scater::plotUMAP(sce_covid, colour_by = 'meta_severity')
g4 <- scater::plotUMAP(sce_covid, colour_by = 'Site')
ggpubr::ggarrange(g1, g2, g3, g4, ncol = 2, nrow = 2, align = "hv")

Next, we need to determine the stably expressed genes used as negative control

head(rownames(sce_covid))
#> [1] "LINC00115" "NOC2L"     "KLHL17"    "PLEKHN1"   "HES4"      "ISG15"

We can then load the stably expressed genes data that is coded in gene symbol:

data(segList)
head(segList$human$human_scSEG)
#> [1] "AAR2"  "AATF"  "ABCF3" "ABHD2" "ABT1"  "ACAP2"

We can then run scMerge2 by setting

ctl = segList$human$human_scSEG;
batch = sce_covid$Site;
condition = sce_covid$meta_severity to trigger the pneotype-specific pseudo-replicates identification.

scMerge2_res <- scMerge2(exprsMat = logcounts(sce_covid),
                         batch = sce_covid$Site,
                         ctl = segList$human$human_scSEG,
                         condition = sce_covid$meta_severity,
                         cellTypes = sce_covid$level3,
                         verbose = FALSE)

assay(sce_covid, "scMerge2") <- scMerge2_res$newY

set.seed(2022)
sce_covid <- scater::runPCA(sce_covid, exprs_values = 'scMerge2', subset_row = hvg)           
sce_covid <- scater::runUMAP(sce_covid, dimred = "PCA")    
scater::plotUMAP(sce_covid, colour_by = 'level3', shape = 'Site')

scater::plotUMAP(sce_covid, colour_by = 'meta_severity')

scater::plotUMAP(sce_covid, colour_by = 'Site')

CITE-seq data integration

Next, we will explore the ADT protein abundance data.

altExp(sce_covid, "ADT")
#> class: SummarizedExperiment 
#> dim: 192 3000 
#> metadata(0):
#> assays(3): counts logcounts scMerge2
#> rownames(192): AB_CD80 AB_CD86 ... AB_LIGHT AB_DR3
#> rowData names(0):
#> colnames(3000): BGCV03_GCGAGAAGTTTGGCGC-1 CCGTACTGTGTCAATC-MH8919332
#>   ... CTACATTAGATCCCGC-MH8919331 BGCV01_TCAGCTCAGGCTCAGA-1
#> colData names(0):

This CITE-seq data has measure 192 proteins. We can see that similar to scRNA-seq data, the protein abundance data also displays the batch effect between the two sites.

set.seed(2022)
sce_covid <- scater::runPCA(sce_covid, exprs_values = 'logcounts', 
                      altexp = "ADT", name = "PCA_ADT")           
sce_covid <- scater::runUMAP(sce_covid, dimred = "PCA_ADT", name = "UMAP_ADT")    
scater::plotReducedDim(sce_covid, colour_by = 'level3', 
                       shape = 'Site', dimred = "UMAP_ADT")

scater::plotReducedDim(sce_covid, colour_by = 'meta_severity', dimred = "UMAP_ADT")

scater::plotReducedDim(sce_covid, colour_by = 'Site', dimred = "UMAP_ADT")

We will try the following settings of scMerge2 to integrate the protein abundance

We will use all proteins as negative control ctl = rownames(altExp(sce_covid, "ADT")) (we find that for data with small number of features, using all features as negative control usually works well);
We will start from a smaller number of unwanted varaition as the number of features in total is only 192: ruvK = 10.

scMerge2_res_adt <- scMerge2(exprsMat = assay(altExp(sce_covid, "ADT"), "logcounts"),
                             batch = sce_covid$Site,
                             ctl = rownames(altExp(sce_covid, "ADT")),
                             condition = sce_covid$meta_severity,
                             ruvK = 10,
                             verbose = FALSE)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
assay(altExp(sce_covid, "ADT"), "scMerge2") <- scMerge2_res_adt$newY

set.seed(2022)
sce_covid <- scater::runPCA(sce_covid, exprs_values = 'scMerge2', 
                      altexp = "ADT", name = "PCA_ADT")           
sce_covid <- scater::runUMAP(sce_covid, dimred = "PCA_ADT", name = "UMAP_ADT")    
scater::plotReducedDim(sce_covid, colour_by = 'level3', 
                       shape = 'Site', dimred = "UMAP_ADT")

scater::plotReducedDim(sce_covid, colour_by = 'meta_severity', dimred = "UMAP_ADT")

scater::plotReducedDim(sce_covid, colour_by = 'Site', dimred = "UMAP_ADT")

Session Info

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scMergeBioc2023_1.0.0       ggpubr_0.6.0               
#>  [3] batchelor_1.17.2            DelayedMatrixStats_1.23.0  
#>  [5] DelayedArray_0.27.10        SparseArray_1.1.11         
#>  [7] S4Arrays_1.1.5              abind_1.4-5                
#>  [9] Matrix_1.6-0                scran_1.29.0               
#> [11] scater_1.29.3               ggplot2_3.4.2              
#> [13] scuttle_1.11.2              scMerge_1.17.0             
#> [15] SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1
#> [17] Biobase_2.61.0              GenomicRanges_1.53.1       
#> [19] GenomeInfoDb_1.37.2         IRanges_2.35.2             
#> [21] S4Vectors_0.39.1            BiocGenerics_0.47.0        
#> [23] MatrixGenerics_1.13.1       matrixStats_1.0.0          
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.3.1            bitops_1.0-7             tibble_3.2.1            
#>   [4] rpart_4.1.19             lifecycle_1.0.3          rstatix_0.7.2           
#>   [7] edgeR_3.43.8             rprojroot_2.0.3          StanHeaders_2.26.27     
#>  [10] processx_3.8.2           lattice_0.21-8           MASS_7.3-60             
#>  [13] backports_1.4.1          magrittr_2.0.3           limma_3.57.7            
#>  [16] Hmisc_5.1-0              sass_0.4.7               rmarkdown_2.23          
#>  [19] jquerylib_0.1.4          yaml_2.3.7               metapod_1.9.0           
#>  [22] pkgbuild_1.4.2           cowplot_1.1.1            RColorBrewer_1.1-3      
#>  [25] ResidualMatrix_1.11.0    zlibbioc_1.47.0          sfsmisc_1.1-15          
#>  [28] purrr_1.0.1              RCurl_1.98-1.12          nnet_7.3-19             
#>  [31] GenomeInfoDbData_1.2.10  ggrepel_0.9.3            inline_0.3.19           
#>  [34] densEstBayes_1.0-2.2     irlba_2.3.5.1            cvTools_0.3.2           
#>  [37] dqrng_0.3.0              pkgdown_2.0.7            codetools_0.2-19        
#>  [40] tidyselect_1.2.0         farver_2.1.1             ScaledMatrix_1.9.1      
#>  [43] viridis_0.6.4            base64enc_0.1-3          jsonlite_1.8.7          
#>  [46] BiocNeighbors_1.19.0     Formula_1.2-5            systemfonts_1.0.4       
#>  [49] bbmle_1.0.25             tools_4.3.1              startupmsg_0.9.6        
#>  [52] ragg_1.2.5               Rcpp_1.0.11              glue_1.6.2              
#>  [55] gridExtra_2.3            xfun_0.39                mgcv_1.9-0              
#>  [58] dplyr_1.1.2              loo_2.6.0                withr_2.5.0             
#>  [61] numDeriv_2016.8-1.1      fastmap_1.1.1            bluster_1.11.4          
#>  [64] fansi_1.0.4              callr_3.7.3              caTools_1.18.2          
#>  [67] digest_0.6.33            rsvd_1.0.5               R6_2.5.1                
#>  [70] textshaping_0.3.6        colorspace_2.1-0         reldist_1.7-2           
#>  [73] gtools_3.9.4             utf8_1.2.3               tidyr_1.3.0             
#>  [76] generics_0.1.3           data.table_1.14.8        FNN_1.1.3.2             
#>  [79] robustbase_0.99-0        prettyunits_1.1.1        htmlwidgets_1.6.2       
#>  [82] uwot_0.1.16              pkgconfig_2.0.3          gtable_0.3.3            
#>  [85] XVector_0.41.1           htmltools_0.5.5          carData_3.0-5           
#>  [88] ruv_0.9.7.1              scales_1.2.1             knitr_1.43              
#>  [91] rstudioapi_0.15.0        checkmate_2.2.0          nlme_3.1-162            
#>  [94] bdsmatrix_1.3-6          M3Drop_1.27.0            cachem_1.0.8            
#>  [97] stringr_1.5.0            KernSmooth_2.23-22       parallel_4.3.1          
#> [100] vipor_0.4.5              foreign_0.8-84           desc_1.4.2              
#> [103] pillar_1.9.0             grid_4.3.1               proxyC_0.3.3            
#> [106] vctrs_0.6.3              gplots_3.1.3             BiocSingular_1.17.1     
#> [109] car_3.1-2                beachmat_2.17.14         cluster_2.1.4           
#> [112] beeswarm_0.4.0           htmlTable_2.4.1          evaluate_0.21           
#> [115] mvtnorm_1.2-2            cli_3.6.1                locfit_1.5-9.8          
#> [118] compiler_4.3.1           rlang_1.1.1              crayon_1.5.2            
#> [121] rstantools_2.3.1.1       ggsignif_0.6.4           labeling_0.4.2          
#> [124] ps_1.7.5                 fs_1.6.3                 ggbeeswarm_0.7.2        
#> [127] stringi_1.7.12           rstan_2.21.8             viridisLite_0.4.2       
#> [130] BiocParallel_1.35.3      munsell_0.5.0            sparseMatrixStats_1.13.0
#> [133] statmod_1.5.0            highr_0.10               igraph_1.5.0.1          
#> [136] broom_1.0.5              memoise_2.0.1            RcppParallel_5.1.7      
#> [139] bslib_0.5.0              DEoptimR_1.1-0           distr_2.9.2

yingxinlin.res@gmail.com ↩︎
Yale University↩︎

Yingxin Lin1