simulate_data — simulate

This function simulates counts from a linear model.

simulate_data(
  .data,
  .estimate_object,
  formula_composition,
  formula_variability = NULL,
  .sample = NULL,
  .cell_group = NULL,
  .coefficients = NULL,
  variability_multiplier = 5,
  number_of_draws = 1,
  mcmc_seed = sample(1e+05, 1),
  cores = detectCores()
)

Arguments

.data: A tibble including a cell_group name column | sample name column | read counts column | factor columns | Pvalue column | a significance column
.estimate_object: The result of sccomp_estimate execution. This is used for sampling from real-data properties.
formula_composition: A formula. The sample formula used to perform the differential cell_group abundance analysis
formula_variability: A formula. The formula describing the model for differential variability, for example ~treatment. In most cases, if differentially variability is of interest, the formula should only include the factor of interest as a large anount of data is needed to define variability depending to each factors.
.sample: A column name as symbol. The sample identifier
.cell_group: A column name as symbol. The cell_group identifier
.coefficients: The column names for coefficients, for example, c(b_0, b_1)
variability_multiplier: A real scalar. This can be used for artificially increasing the variability of the simulation for benchmarking purposes.
number_of_draws: An integer. How may copies of the data you want to draw from the model joint posterior distribution.
mcmc_seed: An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed()#' @param cores Integer, the number of cores to be used for parallel calculations.
cores: Integer, the number of cores to be used for parallel calculations.

Value

A tibble (tbl) with the following columns:

sample - A character column representing the sample name.
type - A factor column representing the type of the sample.
phenotype - A factor column representing the phenotype in the data.
count - An integer column representing the original cell counts.
cell_group - A character column representing the cell group identifier.
b_0 - A numeric column representing the first coefficient used for simulation.
b_1 - A numeric column representing the second coefficient used for simulation.
generated_proportions - A numeric column representing the generated proportions from the simulation.
generated_counts - An integer column representing the generated cell counts from the simulation.
replicate - An integer column representing the replicate number for each draw from the posterior distribution.

Examples


message("Use the following example after having installed install.packages(\"cmdstanr\", repos = c(\"https://stan-dev.r-universe.dev/\", getOption(\"repos\")))")
#> Use the following example after having installed install.packages("cmdstanr", repos = c("https://stan-dev.r-universe.dev/", getOption("repos")))

# \donttest{
  if (instantiate::stan_cmdstan_exists()) {
    data("counts_obj")
    library(dplyr)

    estimate = sccomp_estimate(
      counts_obj,
      ~ type, ~1, sample, cell_group, count,
      cores = 1
    )

    # Set coefficients for cell_groups. In this case all coefficients are 0 for simplicity.
    counts_obj = counts_obj |> mutate(b_0 = 0, b_1 = 0)

    # Simulate data
    simulate_data(counts_obj, estimate, ~type, ~1, sample, cell_group, c(b_0, b_1))
  }
# }