Rforce
Composite Endpoint data type that has gained momentum in the field, due to its fully consideration on the all observated events for each patients, in comparison to the Competing Risk setting that utilize only on the first observed event for each patient.
A parametric approach, like Wcompo(Mao and Lin 2016), has proposed to addressed the inference challenge while imposing the proportional hazard assumption. Inspired by Random Survival Forest (Ishwaran et al. 2008) and Counting Process Information Unit(CPIU) (Wongvibulsin, Wu, and Zeger 2019), we here porposed a non-parametric ensememble method, Random Froest for Composite Endpoints(Rforce), to further relax the proportional hazard assumption.
Below, we like to demonstrate a simulated data example to show off the simplistic pipeline of Rforce.
Generating Composite Endpoint Data
-
Rforcepackage provides functions to generate composite endpoints survival data with options- Constant baseline hazard
- Non-constant() baseline hazard
- Proportional hazard
- Non-proportional hazard
- see details for these options in
Articles/Generating Composite Endpoints/Survival Data
- By default, we generate a non-constant(Weibull) baseline hazard with proportional hazard assumption.
library(Rforce)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data_list <- compo_sim(n_patients = 500, verbose = FALSE)
data_list$true_beta
#> [1] 0.0 0.0 0.0 0.6 0.0 0.0 0.8 0.0 0.7 0.0
data_list$hazard_function
#> function (x)
#> {
#> exp(x %*% true_beta)
#> }
#> <bytecode: 0x5572f7c25970>
#> <environment: 0x5572f7c252e0>
head(data_list$dataset)
#> Id Time Status binary1 binary2 binary3 binary4 binary5 binary6
#> 1 1 0.003487709 2 0 1 0 0 1 0
#> 2 1 0.047510838 2 0 1 0 0 1 0
#> 3 1 1.646181137 1 0 1 0 0 1 0
#> 4 2 0.074786350 2 0 0 1 0 1 0
#> 5 2 0.266915324 2 0 0 1 0 1 0
#> 6 2 0.833510030 2 0 0 1 0 1 0
#> continuous7 continuous8 continuous9 continuous10
#> 1 0.4147548 0.6333024 0.4278975 0.86040861
#> 2 0.4147548 0.6333024 0.4278975 0.86040861
#> 3 0.4147548 0.6333024 0.4278975 0.86040861
#> 4 0.6814544 0.5258878 0.6166269 0.03045993
#> 5 0.6814544 0.5258878 0.6166269 0.03045993
#> 6 0.6814544 0.5258878 0.6166269 0.03045993- Each patient’s observation can span across multiple rows, each row representing one of event of interest, ordered by time of occurrence.
- For completeness, each patient has complete observations up to terminal(fatal) event.
- However,
compo_sim()only generates non time-varying covariates although the proposed methodology can handle time-varying covariates. - Here, we generate 6 binary covariates and 4 continuous covariates by default.
- The hazard function consist of baseline hazard and parametric part , i.e., where and .
- Only two covariates have non-zero effects on the hazard function,
is indicated by
data_list$true_beta.
Implementating Random Censoring
- In real world clinical trials, patients may be censored before experiencing terminal event.
-
Rforcepackage providesrandom_censoring()function to implement random censoring mechanism on the generated complete data.
df <- random_censoring(
data = data_list$dataset,
event_rate = 0.5
)
head(df)
#> Id X Status binary1 binary2 binary3 binary4 binary5 binary6
#> 1 1 0.003487709 2 0 1 0 0 1 0
#> 2 1 0.047510838 2 0 1 0 0 1 0
#> 3 1 1.646181137 1 0 1 0 0 1 0
#> 4 2 0.074786350 2 0 0 1 0 1 0
#> 5 2 0.266915324 2 0 0 1 0 1 0
#> 6 2 0.539628591 0 0 0 1 0 1 0
#> continuous7 continuous8 continuous9 continuous10
#> 1 0.4147548 0.6333024 0.4278975 0.86040861
#> 2 0.4147548 0.6333024 0.4278975 0.86040861
#> 3 0.4147548 0.6333024 0.4278975 0.86040861
#> 4 0.6814544 0.5258878 0.6166269 0.03045993
#> 5 0.6814544 0.5258878 0.6166269 0.03045993
#> 6 0.6814544 0.5258878 0.6166269 0.03045993Step 1: Converting CPIU-wide Format
- Evidently, the hazard function is not a constant of time. Inspired by Counting Process Information Unit(CPIU), we convert the observed data into CPIU-wide format to better capture the hazard pattern for the population.
- The interval of units
intervalsis defined such that each interval contains approximately equal number of events, e.g. 10 intervals. -
patients_to_cpius()is customized for these purpose.
n_intervals <- 10
probs <- seq(0, 1, length.out = n_intervals + 1)
observed_times <- df %>% dplyr::pull(X)
break_points <- stats::quantile(
observed_times,
probs = probs,
na.rm = TRUE
)
break_points[1] <- 0.0 # ensure starting at 0
break_points[n_intervals + 1] <- max(observed_times, na.rm = TRUE) * 1.001 # ensure ending beyond the max time
intervals <- diff(break_points)
cpiu_wide <- patients_to_cpius(
data_to_convert = df,
units_of_cpiu = intervals
)
names(cpiu_wide)
#> [1] "data" "unitsOfCPIUs" "nIntervals"
#> [4] "eventTypes" "eventWeights" "pseudoRisk"
#> [7] "wideFormat" "designMatrix_Y" "auxiliaryFeatures"
#> [10] "nPatients" "variableNameOrignal" "variableSummaryOrignal"
#> [13] "isDummy" "variableUsed" "variableIds"
#> [16] "formula"
class(cpiu_wide)
#> [1] "CPIU"-
CPIUis S3 class defined in Rforce package to better encapsulateCPIU-wideformatted data, along with the metadata. - Two main
data.frame’s are included,-
designMatrix_Y: covariates + number of events in each defined interval -
auxilaryFeatures:Id+pseudo risk timein each defined interval +X+Status
-
Step 2: Encoding Factor/Categorical Variables
In real world, there may have factor or categorical variable in the data. Internally, Rforce encodes those variables into dummy variables, to avoid the collinearlity, the number of encoded dummy varaibles for each categorical
variablealways equal tonlevels(variable) - 1.Worthy to metion, Rforce don’t evaluate the variable importance on individual levels of these categorical variable. Through the carefully design permutation framework, Rforce direct output the variable importance on the variable itself after the model fitting.
By default, all the columns in
designMatrix_Ywill be used in dummy encoding and random forest fitting; One can customize the original columns to be considered by modifying the parametercols_to_keep.
cpiu_wide <- cpius_to_dummy(cpiu_wide, cols_to_keep = NULL)Step 3: Fit Random Forest Model
-
Rforce()does the final model fitting - Internally, there are various of settings for the hyperparameters
for growing random forest with some smart default. For example,
n_treesmtryn_splitsmin_gainmin_node_sizemax_depth-
seed. - see details for the default in
Reference/Rforce().
- Same time, the default
split_ruleis using the proposed quasi-likelihood ratioRforce-QLR, it is more computational efficient than other two proposed rules,-
Rforce-GEE, -
Rforce-GEEInter, - see details for the default in
Reference/Rforce().
-
fit <- Rforce(
cpius = cpiu_wide,
split_rule = "Rforce-QLR"
)
#> Fitting Rforce-QLR forest...3-in-1 Equivalent Command
-
Rforce()is also a bigger wrapper function forStep 1 - 3. -
Rforce()’s result can be reproduced using internalseed. - Meanwhile,
Rforceresult also can be saved and loaded through the designated mechanismssaveRforce()andloadRforce().
fit1 <- Rforce(
data = df,
formula = Surv(Id, X, Status) ~ .,
n_intervals = 10,
split_rule = "Rforce-QLR",
n_trees = 200,
seed = 926
)
#> Converting data to design matrix and auxiliary features...
#> Fitting Rforce-QLR forest...
fit2 <- Rforce(
data = df,
formula = Surv(Id, X, Status) ~ .,
n_intervals = 10,
split_rule = "Rforce-QLR",
n_trees = 200,
seed = 927
)
#> Converting data to design matrix and auxiliary features...
#> Fitting Rforce-QLR forest...
saveRforce(fit1, "./output/fit1/")
saveRforce(fit2, "./output/fit2/")
fit3 <- loadRforce("./output/fit1/")
all.equal(fit1, fit2)
#> [1] "Component \"forestMatrix\": Attributes: < Component \"dim\": Mean relative difference: 0.00999001 >"
#> [2] "Component \"forestMatrix\": Numeric: lengths (108108, 107028) differ"
#> [3] "Component \"predicted\": Mean relative difference: 0.003605287"
#> [4] "Component \"oobPredicted\": Mean relative difference: 0.004326195"
#> [5] "Component \"vimpStat\": Mean relative difference: 1.437804"
#> [6] "Component \"bagMatrix\": Mean relative difference: 1.335842"
#> [7] "Component \"treePhi\": Mean relative difference: 0.1005522"
#> [8] "Component \"seed\": Mean relative difference: 0.001079914"
all.equal(fit1, fit3)
#> [1] TRUEVariable Importance
vimp(fit)
#> binary1 binary2 binary3 binary4 binary5 binary6
#> 0.040946456 0.033170969 0.006243701 1.000000000 0.000000000 0.026226941
#> continuous7 continuous8 continuous9 continuous10
#> 0.205702021 0.013270203 0.296756504 0.023753880Predict
pred <- predict(fit)
head(pred)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 3.517020 2.342582 2.286471 1.588793 2.005169 1.097725 1.043538 0.9363139
#> [2,] 3.318098 2.667595 2.658413 1.405451 1.848858 1.677621 1.195883 0.8378943
#> [3,] 3.060574 2.812747 2.736766 2.018025 2.232071 1.571317 1.256591 0.7803827
#> [4,] 2.426469 2.136668 2.354838 1.358918 1.877999 1.039673 1.050990 0.7952271
#> [5,] 2.813537 2.458203 2.369629 2.005202 2.076070 1.089715 1.435935 1.2543244
#> [6,] 3.114309 2.983228 2.666697 2.421175 2.040845 1.445133 1.869097 1.5109275
#> [,9] [,10]
#> [1,] 0.5675916 0.2303011
#> [2,] 0.6706753 0.3305869
#> [3,] 0.8338628 0.2671524
#> [4,] 0.6159991 0.2747665
#> [5,] 0.6701001 0.3064673
#> [6,] 0.7409022 0.3564489Print Tree
printTree(fit, 1)
#> Tree structure saved to /tmp/RtmpaJpj1x/file1cc051db7a86.dot