Generation of in silico studies

Published

August 21, 2023

Description of simulation model

  • disease: Random names for quantity of interest (in reality something like ‘Hba1c’, ‘Survival after therapy with Imatinibe’ etc.)
    • 3 letters, duplicates allowed
  • assumed_delta: Assumed difference in means that has been used for designing the study, lognormal distribution is assumed
  • actual_delta: Ground truth difference in means, assumed 0 in 50% of the cases, in the rest of the cases assumed_delta multiplied by random number uniformly distributed between 0.2 and 1.3
  • sd: Standard deviation of disease - some hopefully reasonable random number
  • n_planned: Designed group size according to assumed_delta and sd based on stats::power.t.test()
  • n_per_year_assumed: Assumed mean number of patients that can be included in study
  • n_per_year_actual: Actual number of patients that can be included per year
  • expected_study duration and actual_study_duration: Designed group size divided by expected / actual number of patients per year
  • actual_power: Power calculated based on actual_delta and sd
  • null_rejected: Boolean variable based on Bernoulli trial with p taken from actual_power
  • utiliy_null_rejected: Utility of study - uniformly distributed random number between 0 and 1
  • utiliy_null_not_rejected: utiliy_null_rejected / 2
  • cost_per_patient: Monetary cost to include a patient in the study - random number on arbitrary scale

Simulation results

# set parameters:
set.seed(87)
nr_of_studies <- 1e4

# utiliy functions:
get_random_names <- function(n, length = 3) {
  replicate(n = n, paste0(sample(letters, size = length), collapse = ''))
}

get_n_planned <- Vectorize(function(delta, sd, power) {
  power.t.test(delta = delta, sd = sd, power = 0.8)$n |> 
    ceiling()
})

get_actual_power <- Vectorize(function(delta, sd, n) {
  power.t.test(delta = delta, sd = sd, n = n)$power
})


# simulate studies:
studies <- tibble(disease = get_random_names(n = nr_of_studies, length = 2),
                  endpoint = get_random_names(n = nr_of_studies, length = 1),
                  assumed_delta = rlnorm(nr_of_studies),
                  actual_delta = ifelse(runif(nr_of_studies) > 0.5, 
                                        0, 
                                        assumed_delta * runif(nr_of_studies, 0.2, 1.3)),
                  sd = assumed_delta * (1 + rlnorm(nr_of_studies)),
                  n_planned = 
                    get_n_planned(delta = assumed_delta, 
                                  sd = sd, 
                                  power = 0.8),
                  n_per_year_assumed = 10 + rlnorm(nr_of_studies),
                  n_per_year_actual = 
                    n_per_year_assumed * runif(n_planned, 0.5, 1.2),
                  expected_study_duration = n_planned / n_per_year_assumed,
                  actual_study_duration = n_planned / n_per_year_actual,
                  actual_power = get_actual_power(delta = actual_delta, 
                                                  sd = sd, 
                                                  n = n_planned),
                  null_rejected = runif(nr_of_studies) < actual_power,
                  utility_null_rejected = runif(nr_of_studies),
                  utility_null_not_rejected = utility_null_rejected / 2,
                  cost_per_patient = rlnorm(nr_of_studies))

Analysis

20 randomly chosen studies:

sample_n(studies, 20) |> 
           kable(align = 'c', digits = 2)
disease endpoint assumed_delta actual_delta sd n_planned n_per_year_assumed n_per_year_actual expected_study_duration actual_study_duration actual_power null_rejected utility_null_rejected utility_null_not_rejected cost_per_patient
et z 1.63 0.00 5.15 159 11.59 9.02 13.72 17.63 0.03 FALSE 0.51 0.25 2.07
rc i 0.42 0.53 0.60 33 11.74 8.54 2.81 3.86 0.94 TRUE 0.36 0.18 3.63
zf p 1.14 1.22 1.80 40 10.67 7.60 3.75 5.27 0.85 FALSE 0.95 0.47 0.34
ok y 0.26 0.16 0.32 26 12.30 10.05 2.11 2.59 0.42 FALSE 0.89 0.45 3.49
ny r 1.50 0.00 4.34 133 11.60 13.08 11.47 10.17 0.03 FALSE 0.39 0.19 0.33
cp m 1.62 1.72 5.90 211 11.06 8.24 19.08 25.59 0.85 TRUE 0.49 0.24 1.25
rw b 13.74 9.18 31.73 85 22.36 25.38 3.80 3.35 0.47 FALSE 0.04 0.02 0.21
wt q 1.96 1.64 2.68 31 10.79 10.16 2.87 3.05 0.66 FALSE 0.66 0.33 1.02
is t 1.22 1.30 2.54 70 11.28 12.09 6.21 5.79 0.85 TRUE 0.52 0.26 0.23
dt z 0.55 0.00 0.92 46 10.62 11.70 4.33 3.93 0.03 FALSE 0.94 0.47 0.83
gj h 1.14 0.00 1.50 29 10.19 10.29 2.85 2.82 0.02 FALSE 0.09 0.05 1.04
ux h 0.56 0.46 0.91 43 11.44 7.19 3.76 5.98 0.65 FALSE 0.87 0.44 1.62
gz k 0.76 0.00 1.13 37 13.06 10.68 2.83 3.46 0.02 FALSE 0.41 0.20 0.12
bl h 0.31 0.30 1.04 180 11.01 5.54 16.35 32.50 0.79 TRUE 0.96 0.48 1.18
nl g 1.19 1.48 1.33 21 10.67 9.62 1.97 2.18 0.94 TRUE 0.07 0.04 1.80
zx w 3.37 3.08 5.84 49 10.12 11.16 4.84 4.39 0.73 TRUE 0.13 0.07 0.60
kv g 3.54 0.00 5.95 46 11.21 11.80 4.10 3.90 0.03 FALSE 0.41 0.20 0.19
tn c 1.23 1.14 1.96 41 12.77 14.84 3.21 2.76 0.74 TRUE 0.73 0.37 0.97
yk m 0.41 0.00 0.63 37 11.18 6.38 3.31 5.80 0.02 FALSE 0.10 0.05 1.05
ud k 0.13 0.08 0.21 39 11.13 11.38 3.50 3.43 0.38 FALSE 0.19 0.09 1.43

Descripive statistics

skimr::skim(studies)
Data summary
Name studies
Number of rows 10000
Number of columns 15
_______________________
Column type frequency:
character 2
logical 1
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
disease 0 1 2 2 0 650 0
endpoint 0 1 1 1 0 26 0

Variable type: logical

skim_variable n_missing complete_rate mean count
null_rejected 0 1 0.29 FAL: 7141, TRU: 2859

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
assumed_delta 0 1 1.64 2.05 0.02 0.51 1.01 1.98 31.39 ▇▁▁▁▁
actual_delta 0 1 0.63 1.46 0.00 0.00 0.01 0.68 33.86 ▇▁▁▁▁
sd 0 1 4.33 7.89 0.04 1.04 2.20 4.66 266.44 ▇▁▁▁▁
n_planned 0 1 179.94 673.96 18.00 37.00 64.00 138.00 26294.00 ▇▁▁▁▁
n_per_year_assumed 0 1 11.68 2.49 10.02 10.51 11.00 11.96 91.31 ▇▁▁▁▁
n_per_year_actual 0 1 9.95 3.27 5.09 7.68 9.67 11.72 100.92 ▇▁▁▁▁
expected_study_duration 0 1 15.84 60.65 0.54 3.25 5.58 11.95 2255.34 ▇▁▁▁▁
actual_study_duration 0 1 19.59 72.64 0.52 3.94 6.78 14.82 2827.49 ▇▁▁▁▁
actual_power 0 1 0.28 0.33 0.02 0.03 0.08 0.56 0.96 ▇▁▁▁▂
utility_null_rejected 0 1 0.50 0.29 0.00 0.25 0.50 0.75 1.00 ▇▇▇▇▇
utility_null_not_rejected 0 1 0.25 0.14 0.00 0.12 0.25 0.37 0.50 ▇▇▇▇▇
cost_per_patient 0 1 1.62 2.04 0.02 0.50 0.99 1.93 35.66 ▇▁▁▁▁

Additional aspects:

  • expected_study_duration and actual_study_duration have not very realistic values yet → maybe not all of the considered studies would have been feasible due to insufficient patient numbers expected_study_duration very long for some studies