sccomp - Tests differences in cell type proportions and variability from single-cell data ================
Cellular omics such as single-cell genomics, proteomics, and microbiomics allow the characterization of tissue and microbial community composition, which can be compared between conditions to identify biological drivers. This strategy has been critical to unveiling markers of disease progression in conditions such as cancer and pathogen infections.
For cellular omic data, no method for differential variability analysis exists, and methods for differential composition analysis only take a few fundamental data properties into account. Here we introduce sccomp, a generalised method for differential composition and variability analyses capable of jointly modelling data count distribution, compositionality, group-specific variability, and proportion mean-variability association, while being robust to outliers.
sccomp is an extensive analysis framework that allows realistic data simulation and cross-study knowledge transfer. We demonstrate that mean-variability association is ubiquitous across technologies, highlighting the inadequacy of the very popular Dirichlet-multinomial modeling and providing essential principles for differential variability analysis.
Mangiola, Stefano, Alexandra J. Roth-Schulze, Marie Trussart, Enrique Zozaya-Valdés, Mengyao Ma, Zijie Gao, Alan F. Rubin, Terence P. Speed, Heejung Shim, and Anthony T. Papenfuss. 2023. “Sccomp: Robust Differential Composition and Variability Analysis for Single-Cell Data.” Proceedings of the National Academy of Sciences of the United States of America 120 (33): e2203828120. https://doi.org/10.1073/pnas.2203828120 PNAS - sccomp: Robust differential composition and variability analysis for single-cell data
sccomp
tests differences in cell type proportions from single-cell data. It is robust against outliers, it models continuous and discrete factors, and capable of random-effect/intercept modelling.
sccomp
is based on cmdstanr
which provides the latest version of cmdstan
the Bayesian modelling tool. cmdstanr
is not on CRAN, so we need to have 3 simple step process (that will be prompted to the user is forgot).
sccomp
cmdstanr
cmdstanr
call to cmdstan
installationBioconductor
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
# Step 1
BiocManager::install("sccomp")
# Step 2
install.packages("cmdstanr", repos = c("https://stan-dev.r-universe.dev/", getOption("repos")))
# Step 3
cmdstanr::check_cmdstan_toolchain(fix = TRUE) # Just checking system setting
cmdstanr::install_cmdstan()
Github
# Step 1
devtools::install_github("MangiolaLaboratory/sccomp")
# Step 2
install.packages("cmdstanr", repos = c("https://stan-dev.r-universe.dev/", getOption("repos")))
# Step 3
cmdstanr::check_cmdstan_toolchain(fix = TRUE) # Just checking system setting
cmdstanr::install_cmdstan()
Function | Description |
---|---|
sccomp_estimate |
Fit the model onto the data, and estimate the coefficients |
sccomp_remove_outliers |
Identify outliers probabilistically based on the model fit, and exclude them from the estimation |
sccomp_test |
Calculate the probability that the coefficients are outside the H0 interval (i.e. test_composition_above_logit_fold_change) |
sccomp_replicate |
Simulate data from the model, or part of the model |
sccomp_predict |
Predicts proportions, based on the model, or part of the model |
sccomp_remove_unwanted_variation |
Removes the variability for unwanted factors |
plot |
Plots summary plots to asses significance |
library(dplyr)
library(sccomp)
library(ggplot2)
library(forcats)
library(tidyr)
data("seurat_obj")
data("sce_obj")
data("counts_obj")
sccomp
can model changes in composition and variability. By default, the formula for variability is either ~1
, which assumes that the cell-group variability is independent of any covariate or ~ factor_of_interest
, which assumes that the model is dependent on the factor of interest only. The variability model must be a subset of the model for composition.
Of the output table, the estimate columns start with the prefix c_
indicate composition
, or with v_
indicate variability
(when formula_variability is set).
sccomp_result =
sce_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
cores = 1
) |>
sccomp_remove_outliers(cores = 1) |> # Optional
sccomp_test()
sccomp_result =
counts_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
.count = count,
cores = 1, verbose = FALSE
) |>
sccomp_remove_outliers(cores = 1, verbose = FALSE) |> # Optional
sccomp_test()
Here you see the results of the fit, the effects of the factor on composition and variability. You also can see the uncertainty around those effects.
The output is a tibble containing the Following columns
cell_group
- The cell groups being tested.parameter
- The parameter being estimated from the design matrix described by the input formula_composition
and formula_variability
.factor
- The covariate factor in the formula, if applicable (e.g., not present for Intercept or contrasts).c_lower
- Lower (2.5%) quantile of the posterior distribution for a composition (c) parameter.c_effect
- Mean of the posterior distribution for a composition (c) parameter.c_upper
- Upper (97.5%) quantile of the posterior distribution for a composition (c) parameter.c_pH0
- Probability of the null hypothesis (no difference) for a composition (c). This is not a p-value.c_FDR
- False-discovery rate of the null hypothesis for a composition (c).v_lower
- Lower (2.5%) quantile of the posterior distribution for a variability (v) parameter.v_effect
- Mean of the posterior distribution for a variability (v) parameter.v_upper
- Upper (97.5%) quantile of the posterior distribution for a variability (v) parameter.v_pH0
- Probability of the null hypothesis for a variability (v).v_FDR
- False-discovery rate of the null hypothesis for a variability (v).count_data
- Nested input count data.
sccomp_result
A plot of group proportions, faceted by groups. The blue boxplots represent the posterior predictive check. If the model is descriptively adequate for the data, the blue boxplots should roughly overlay the black boxplots, which represent the observed data. The outliers are coloured in red. A boxplot will be returned for every (discrete) covariate present in formula_composition. The colour coding represents the significant associations for composition and/or variability.
sccomp_result |>
sccomp_boxplot(factor = "type")
A plot of estimates of differential composition (c_) on the x-axis and differential variability (v_) on the y-axis. The error bars represent 95% credible intervals. The dashed lines represent the minimal effect that the hypothesis test is based on. An effect is labelled as significant if it exceeds the minimal effect according to the 95% credible interval. Facets represent the covariates in the model.
sccomp_result |>
plot_1D_intervals()
We can plot the relationship between abundance and variability. As we can see below, they are positively correlated. sccomp models this relationship to obtain a shrinkage effect on the estimates of both the abundance and the variability. This shrinkage is adaptive as it is modelled jointly, thanks to Bayesian inference.
sccomp_result |>
plot_2D_intervals()
You can produce the series of plots calling the plot
method.
sccomp_result |> plot()
Note: If counts are available, we strongly discourage the use of proportions, as an important source of uncertainty (i.e., for rare groups/cell types) is not modeled.
The use of proportions is better suited for modelling deconvolution results (e.g., of bulk RNA data), in which case counts are not available.
Proportions should be greater than 0. Assuming that zeros derive from a precision threshold (e.g., deconvolution), zeros are converted to the smallest non-zero value.
sccomp
is able to fit erbitrary complex models. In this example we have a continuous and binary covariate.
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + continuous_covariate,
.sample = sample, .cell_group = cell_group,
cores = 1, verbose=FALSE
)
res
sccomp
supports multilevel modeling by allowing the inclusion of random effects in the compositional and variability formulas. This is particularly useful when your data has hierarchical or grouped structures, such as measurements nested within subjects, batches, or experimental units. By incorporating random effects, sccomp can account for variability at different levels of your data, improving model fit and inference accuracy.
In this example, we demonstrate how to fit a random intercept model using sccomp. We’ll model the cell-type proportions with both fixed effects (e.g., treatment) and random effects (e.g., subject-specific variability).
Here is the input data
seurat_obj[[]] |> as_tibble()
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.4.0 but the current version is
## 4.4.1; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
## [90m# A tibble: 106,297 × 9[39m
## cell_group nCount_RNA nFeature_RNA group__ group__wrong sample type group2__
## [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m
## [90m 1[39m CD4 naive 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 2[39m Mono clas… 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 3[39m CD4 cm S1… 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 4[39m B immature 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 5[39m CD8 naive 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 6[39m CD4 naive 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 7[39m Mono clas… 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 8[39m CD4 cm S1… 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m 9[39m CD4 cm hi… 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m10[39m B immature 0 0 GROUP1 1 SI-GA… canc… GROUP21
## [90m# ℹ 106,287 more rows[39m
## [90m# ℹ 1 more variable: continuous_covariate <dbl>[39m
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + (1 | group__),
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
)
res
sccomp
can model random slopes. We providean example below.
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + (type | group__),
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
)
res
If you have a more complex hierarchy, such as measurements nested within subjects and subjects nested within batches, you can include multiple grouping variables. Here group2__
is nested within group__
.
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + (type | group__) + (1 | group2__),
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
)
res
The estimated effects are expressed in the unconstrained space of the parameters, similar to differential expression analysis that expresses changes in terms of log fold change. However, for differences in proportion, logit fold change must be used, which is harder to interpret and understand.
Therefore, we provide a more intuitive proportional fold change that can be more easily understood. However, these cannot be used to infer significance (use sccomp_test() instead), and a lot of care must be taken given the nonlinearity of these measures (a 1-fold increase from 0.0001 to 0.0002 carries a different weight than a 1-fold increase from 0.4 to 0.8).
From your estimates, you can specify which effects you are interested in (this can be a subset of the full model if you wish to exclude unwanted effects), and the two points you would like to compare.
In the case of a categorical variable, the starting and ending points are categories.
sccomp_result |>
sccomp_proportional_fold_change(
formula_composition = ~ type,
from = "healthy",
to = "cancer"
) |>
select(cell_group, statement)
seurat_obj |>
sccomp_estimate(
formula_composition = ~ 0 + type,
.sample = sample,
.cell_group = cell_group,
cores = 1, verbose = FALSE
) |>
sccomp_test( contrasts = c("typecancer - typehealthy", "typehealthy - typecancer"))
This is achieved through model comparison with loo
. In the following example, the model with association with factors better fits the data compared to the baseline model with no factor association. For comparisons check_outliers
must be set to FALSE as the leave-one-out must work with the same amount of data, while outlier elimination does not guarantee it.
If elpd_diff
is away from zero of > 5 se_diff
difference of 5, we are confident that a model is better than the other reference. In this case, -79.9 / 11.5 = -6.9, therefore we can conclude that model one, the one with factor association, is better than model two.
library(loo)
# Fit first model
model_with_factor_association =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
inference_method = "hmc",
enable_loo = TRUE
)
# Fit second model
model_without_association =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ 1,
.sample = sample,
.cell_group = cell_group,
inference_method = "hmc",
enable_loo = TRUE
)
# Compare models
loo_compare(
attr(model_with_factor_association, "fit")$loo(),
attr(model_without_association, "fit")$loo()
)
We can model the cell-group variability also dependent on the type, and so test differences in variability
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type,
formula_variability = ~ type,
.sample = sample,
.cell_group = cell_group,
cores = 1, verbose = FALSE
)
res
Plot 1D significance plot
plots = res |> sccomp_test() |> plot()
plots$credible_intervals_1D
Plot 2D significance plot Data points are cell groups. Error bars are the 95% credible interval. The dashed lines represent the default threshold fold change for which the probabilities (c_pH0, v_pH0) are calculated. pH0 of 0 represent the rejection of the null hypothesis that no effect is observed.
This plot is provided only if differential variability has been tested. The differential variability estimates are reliable only if the linear association between mean and variability for (intercept)
(left-hand side facet) is satisfied. A scatterplot (besides the Intercept) is provided for each category of interest. For each category of interest, the composition and variability effects should be generally uncorrelated.
plots$credible_intervals_2D
We recommend setting bimodal_mean_variability_association = TRUE
. The bimodality of the mean-variability association can be confirmed from the plots$credible_intervals_2D (see below).
We recommend setting bimodal_mean_variability_association = FALSE
(Default).
It is possible to directly evaluate the posterior distribution. In this example, we plot the Monte Carlo chain for the slope parameter of the first cell type. We can see that it has converged and is negative with probability 1.
library(cmdstanr)
library(posterior)
library(bayesplot)
# Assuming res contains the fit object from cmdstanr
fit <- res |> attr("fit")
# Extract draws for 'beta[2,1]'
draws <- as_draws_array(fit$draws("beta[2,1]"))
# Create a traceplot for 'beta[2,1]'
mcmc_trace(draws, pars = "beta[2,1]")
The new tidy framework was introduced in 2024, two, understand the differences and improvements. Compared to the old framework, please read this blog post.