Generation of in silico studies

Published

August 21, 2023

Description of simulation model

disease: Random names for quantity of interest (in reality something like ‘Hba1c’, ‘Survival after therapy with Imatinibe’ etc.)
- 3 letters, duplicates allowed
assumed_delta: Assumed difference in means that has been used for designing the study, lognormal distribution is assumed
actual_delta: Ground truth difference in means, assumed 0 in 50% of the cases, in the rest of the cases assumed_delta multiplied by random number uniformly distributed between 0.2 and 1.3
sd: Standard deviation of disease - some hopefully reasonable random number
n_planned: Designed group size according to assumed_delta and sd based on stats::power.t.test()
n_per_year_assumed: Assumed mean number of patients that can be included in study
n_per_year_actual: Actual number of patients that can be included per year
expected_study duration and actual_study_duration: Designed group size divided by expected / actual number of patients per year
actual_power: Power calculated based on actual_delta and sd
null_rejected: Boolean variable based on Bernoulli trial with $p$ taken from actual_power
utiliy_null_rejected: Utility of study - uniformly distributed random number between 0 and 1
utiliy_null_not_rejected: utiliy_null_rejected / 2
cost_per_patient: Monetary cost to include a patient in the study - random number on arbitrary scale

Simulation results

# set parameters:
set.seed(87)
nr_of_studies <- 1e4

# utiliy functions:
get_random_names <- function(n, length = 3) {
  replicate(n = n, paste0(sample(letters, size = length), collapse = ''))
}

get_n_planned <- Vectorize(function(delta, sd, power) {
  power.t.test(delta = delta, sd = sd, power = 0.8)$n |> 
    ceiling()
})

get_actual_power <- Vectorize(function(delta, sd, n) {
  power.t.test(delta = delta, sd = sd, n = n)$power
})


# simulate studies:
studies <- tibble(disease = get_random_names(n = nr_of_studies, length = 2),
                  endpoint = get_random_names(n = nr_of_studies, length = 1),
                  assumed_delta = rlnorm(nr_of_studies),
                  actual_delta = ifelse(runif(nr_of_studies) > 0.5, 
                                        0, 
                                        assumed_delta * runif(nr_of_studies, 0.2, 1.3)),
                  sd = assumed_delta * (1 + rlnorm(nr_of_studies)),
                  n_planned = 
                    get_n_planned(delta = assumed_delta, 
                                  sd = sd, 
                                  power = 0.8),
                  n_per_year_assumed = 10 + rlnorm(nr_of_studies),
                  n_per_year_actual = 
                    n_per_year_assumed * runif(n_planned, 0.5, 1.2),
                  expected_study_duration = n_planned / n_per_year_assumed,
                  actual_study_duration = n_planned / n_per_year_actual,
                  actual_power = get_actual_power(delta = actual_delta, 
                                                  sd = sd, 
                                                  n = n_planned),
                  null_rejected = runif(nr_of_studies) < actual_power,
                  utility_null_rejected = runif(nr_of_studies),
                  utility_null_not_rejected = utility_null_rejected / 2,
                  cost_per_patient = rlnorm(nr_of_studies))

Analysis

20 randomly chosen studies:

sample_n(studies, 20) |> 
           kable(align = 'c', digits = 2)

disease	endpoint	assumed_delta	actual_delta	sd	n_planned	n_per_year_assumed	n_per_year_actual	expected_study_duration	actual_study_duration	actual_power	null_rejected	utility_null_rejected	utility_null_not_rejected	cost_per_patient
et	z	1.63	0.00	5.15	159	11.59	9.02	13.72	17.63	0.03	FALSE	0.51	0.25	2.07
rc	i	0.42	0.53	0.60	33	11.74	8.54	2.81	3.86	0.94	TRUE	0.36	0.18	3.63
zf	p	1.14	1.22	1.80	40	10.67	7.60	3.75	5.27	0.85	FALSE	0.95	0.47	0.34
ok	y	0.26	0.16	0.32	26	12.30	10.05	2.11	2.59	0.42	FALSE	0.89	0.45	3.49
ny	r	1.50	0.00	4.34	133	11.60	13.08	11.47	10.17	0.03	FALSE	0.39	0.19	0.33
cp	m	1.62	1.72	5.90	211	11.06	8.24	19.08	25.59	0.85	TRUE	0.49	0.24	1.25
rw	b	13.74	9.18	31.73	85	22.36	25.38	3.80	3.35	0.47	FALSE	0.04	0.02	0.21
wt	q	1.96	1.64	2.68	31	10.79	10.16	2.87	3.05	0.66	FALSE	0.66	0.33	1.02
is	t	1.22	1.30	2.54	70	11.28	12.09	6.21	5.79	0.85	TRUE	0.52	0.26	0.23
dt	z	0.55	0.00	0.92	46	10.62	11.70	4.33	3.93	0.03	FALSE	0.94	0.47	0.83
gj	h	1.14	0.00	1.50	29	10.19	10.29	2.85	2.82	0.02	FALSE	0.09	0.05	1.04
ux	h	0.56	0.46	0.91	43	11.44	7.19	3.76	5.98	0.65	FALSE	0.87	0.44	1.62
gz	k	0.76	0.00	1.13	37	13.06	10.68	2.83	3.46	0.02	FALSE	0.41	0.20	0.12
bl	h	0.31	0.30	1.04	180	11.01	5.54	16.35	32.50	0.79	TRUE	0.96	0.48	1.18
nl	g	1.19	1.48	1.33	21	10.67	9.62	1.97	2.18	0.94	TRUE	0.07	0.04	1.80
zx	w	3.37	3.08	5.84	49	10.12	11.16	4.84	4.39	0.73	TRUE	0.13	0.07	0.60
kv	g	3.54	0.00	5.95	46	11.21	11.80	4.10	3.90	0.03	FALSE	0.41	0.20	0.19
tn	c	1.23	1.14	1.96	41	12.77	14.84	3.21	2.76	0.74	TRUE	0.73	0.37	0.97
yk	m	0.41	0.00	0.63	37	11.18	6.38	3.31	5.80	0.02	FALSE	0.10	0.05	1.05
ud	k	0.13	0.08	0.21	39	11.13	11.38	3.50	3.43	0.38	FALSE	0.19	0.09	1.43

Descripive statistics

skimr::skim(studies)

Data summary
Name	studies
Number of rows	10000
Number of columns	15
_______________________
Column type frequency:
character	2
logical	1
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
disease	0	1	2	2	0	650	0
endpoint	0	1	1	1	0	26	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
null_rejected	0	1	0.29	FAL: 7141, TRU: 2859

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
assumed_delta	1	1.64	2.05	0.02	0.51	1.01	1.98	31.39	▇▁▁▁▁
actual_delta	1	0.63	1.46	0.00	0.00	0.01	0.68	33.86	▇▁▁▁▁
sd	1	4.33	7.89	0.04	1.04	2.20	4.66	266.44	▇▁▁▁▁
n_planned	1	179.94	673.96	18.00	37.00	64.00	138.00	26294.00	▇▁▁▁▁
n_per_year_assumed	1	11.68	2.49	10.02	10.51	11.00	11.96	91.31	▇▁▁▁▁
n_per_year_actual	1	9.95	3.27	5.09	7.68	9.67	11.72	100.92	▇▁▁▁▁
expected_study_duration	1	15.84	60.65	0.54	3.25	5.58	11.95	2255.34	▇▁▁▁▁
actual_study_duration	1	19.59	72.64	0.52	3.94	6.78	14.82	2827.49	▇▁▁▁▁
actual_power	1	0.28	0.33	0.02	0.03	0.08	0.56	0.96	▇▁▁▁▂
utility_null_rejected	1	0.50	0.29	0.00	0.25	0.50	0.75	1.00	▇▇▇▇▇
utility_null_not_rejected	1	0.25	0.14	0.00	0.12	0.25	0.37	0.50	▇▇▇▇▇
cost_per_patient	1	1.62	2.04	0.02	0.50	0.99	1.93	35.66	▇▁▁▁▁

Additional aspects:

expected_study_duration and actual_study_duration have not very realistic values yet → maybe not all of the considered studies would have been feasible due to insufficient patient numbers expected_study_duration very long for some studies