Rforce implements the methodology described in Rforce: Random Forests for Composite Endpoints, which models composite endpoints consisting of non-fatal events and terminal events.
The method builds random forests using generalized estimating equations (GEE) and handles dependent censoring caused by terminal events using the concept of pseudo-at-risk duration.
This work received the 2024 Student Paper Competition Award from the American Statistical Association (ASA), jointly from the Section on Statistical Computing and Section on Statistical Graphics.
The paper is published in Statistics in Medicine:
- PMID: 41640374
- DOI: 10.1002/sim.70413
The software provides both:
- R API
- C API
Key features include:
- High computational and memory efficiency
- Parallel computation using OpenMP
- Reproducible results (see the reproducibility example here)
Installation
Dependencies
-
cmake >= 3.16.0– build system for the C API -
OpenMP– parallel computing -
R >= 4.3.3– R interface
C API Subcommands
Train
Train a composite endpoint forest model.
Options:
| Option | Description | Required/Optional | Default |
|---|---|---|---|
-d, --designMatrixY=<str> |
Path to design matrix | Required | |
-a, --auxiliary=<str> |
Path to auxiliary features | Required | |
-u, --unitsOfCPIU=<str> |
Path to unitsOfCPIU file | Required | |
-o, --out=<str> |
Path to output directory | Optional | Current working directory |
-v, --verbose=<int> |
Verbosity level (0–3) | Optional | 0 |
-m, --maxDepth=<int> |
Maximum tree depth | Optional | 10 |
-n, --minNodeSize=<int> |
Minimum node size | Optional | 2 × len(unitsOfCPIU) - 1 |
-g, --gain=<float> |
Minimum gain for split | Optional | 0.0 (likelihood-based) or 1.3 (GEE-based) |
-t, --mtry=<int> |
Number of variables to try during splitting | Optional | √(number of variables) |
-s, --nsplits=<int> |
Number of splits to try per variable | Optional | 10 |
-r, --nTrees=<int> |
Number of trees | Optional | 200 |
-e, --seed=<int> |
Random seed | Optional | 926 |
-p, --nPerms=<int> |
Number of permutations for variable importance | Optional | 10 |
-u, --nVars=<int> |
Number of variables in the design matrix | Optional | Number of columns |
-i, --pathVarIds=<str> |
Variable IDs (categorical variables supported via repeated IDs) | Optional | |
-x, --iDot |
Output tree DOT files | Optional | False |
-k, --k=<int> |
Bayesian estimator parameter for leaf output | Optional | 4 |
-L, --long |
Use multiple rows per patient (RF-SLAM style) | Optional | |
-N, --nopseudo |
Do not estimate pseudo risk time | Optional | |
-P, --pseudorisk1 |
Use original pseudo-risk time (population level) | Optional | |
-B, --pseudorisk2 |
Recalculate pseudo-risk time at each tree (default) | Optional | |
-D, --dynamicrisk |
Dynamically estimate pseudo-risk time at each split | Optional | |
-F, --nophi |
Fix φ = 1, do not estimate φ | Optional | |
-P, --phi1 |
Estimate φ at population level | Optional | |
-H, --phi2 |
Estimate φ at tree level (default) | Optional | |
-Y, --dynamicphi |
Dynamically estimate φ at each split | Optional | |
-G, --gee |
Use GEE approach | Optional | |
-A, --padjust=<str> |
p-value adjustment method (bonferroni, holm, hochberg, hommel, BH, BY, none) |
Optional | BH |
-I, --interaction |
Add interaction terms for GEE | Optional | NULL |
-S, --asym |
Use asymptotic approach | Optional | |
-T, --threads=<int> |
Number of parallel computing threads | Optional | 8 |
Examples
Train a model:
Rforce train -d design_matrix.csv -a auxiliary_features.csv -u unitsOfCPIU.txt -o output_folder -v 1Predict with a trained model:
Notes
- By default, pseudo-risk time and φ (phi) are re-estimated at each tree level.
- Dynamic options (
--dynamicrisk,--dynamicphi) allow estimates at each split for more flexibility. - Parallel computation is supported via the
--threadsoption. - GEE-based splitting with p-value adjustment is available.
- An R API is currently actively developing which includes:
- Classical survival data generation
- Composite endpoint data generation
-
Wcompomethodology realization - An R interface to Rforce