Estimate regression Random Forest using spectral deconfounding. The spectrally deconfounded Random Forest (SDForest) combines SDTrees in the same way, as in the original Random Forest Breiman2001RandomForestsSDModels. The idea is to combine multiple regression trees into an ensemble in order to decrease variance and get a smooth function. Ensembles work best if the different models are independent of each other. To decorrelate the regression trees as much as possible from each other, we have two mechanisms. The first one is bagging Breiman1996BaggingPredictorsSDModels, where we train each regression tree on an independent bootstrap sample of the observations, e.g., we draw a random sample of size \(n\) with replacement from the observations. The second mechanic to decrease the correlation is that only a random subset of the covariates is available for each split. Before each split, we sample \(\text{mtry} \leq p\) from all the covariates and choose the one that reduces the loss the most only from those. $$\widehat{f(X)} = \frac{1}{N_{tree}} \sum_{t = 1}^{N_{tree}} SDTree_t(X)$$

SDForest(
  formula = NULL,
  data = NULL,
  x = NULL,
  y = NULL,
  nTree = 100,
  cp = 0,
  min_sample = 5,
  mtry = NULL,
  mc.cores = 1,
  Q_type = "trim",
  trim_quantile = 0.5,
  q_hat = 0,
  Qf = NULL,
  A = NULL,
  gamma = 7,
  max_size = NULL,
  gpu = FALSE,
  return_data = TRUE,
  mem_size = 1e+07,
  leave_out_ind = NULL,
  envs = NULL,
  nTree_leave_out = NULL,
  nTree_env = NULL,
  max_candidates = 100,
  Q_scale = TRUE,
  verbose = TRUE
)

Arguments

formula

Object of class formula or describing the model to fit of the form y ~ x1 + x2 + ... where y is a numeric response and x1, x2, ... are vectors of covariates. Interactions are not supported.

data

Training data of class data.frame containing the variables in the model.

x

Matrix of covariates, alternative to formula and data.

y

Vector of responses, alternative to formula and data.

nTree

Number of trees to grow.

cp

Complexity parameter, minimum loss decrease to split a node. A split is only performed if the loss decrease is larger than cp * initial_loss, where initial_loss is the loss of the initial estimate using only a stump.

min_sample

Minimum number of observations per leaf. A split is only performed if both resulting leaves have at least min_sample observations.

mtry

Number of randomly selected covariates to consider for a split, if NULL half of the covariates are available for each split. \(\text{mtry} = \lfloor \frac{p}{2} \rfloor\)

mc.cores

Number of cores to use for parallel processing, if mc.cores > 1 the trees are estimated in parallel.

Q_type

Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'. 'trim' corresponds to the Trim transform Cevid2020SpectralModelsSDModels as implemented in the Doubly debiased lasso Guo2022DoublyConfoundingSDModels, 'pca' to the PCA transformationPaul2008PreconditioningProblemsSDModels. See get_Q.

trim_quantile

Quantile for Trim transform, only needed for trim, see get_Q.

q_hat

Assumed confounding dimension, only needed for pca, see get_Q.

Qf

Spectral transformation, if NULL it is internally estimated using get_Q.

A

Numerical Anchor of class matrix. See get_W.

gamma

Strength of distributional robustness, \(\gamma \in [0, \infty]\). See get_W.

max_size

Maximum number of observations used for a bootstrap sample. If NULL n samples with replacement are drawn.

gpu

If TRUE, the calculations are performed on the GPU. If it is properly set up.

return_data

If TRUE, the training data is returned in the output. This is needed for prune.SDForest, regPath.SDForest, and for mergeForest.

mem_size

Amount of split candidates that can be evaluated at once. This is a trade-off between memory and speed can be decreased if either the memory is not sufficient or the gpu is to small.

leave_out_ind

Indices of observations that should not be used for training.

envs

Vector of environments of class factor which can be used for stratified tree fitting.

nTree_leave_out

Number of trees that should be estimated while leaving one of the environments out. Results in number of environments times number of trees.

nTree_env

Number of trees that should be estimated for each environment. Results in number of environments times number of trees.

max_candidates

Maximum number of split points that are proposed at each node for each covariate.

Q_scale

Should data be scaled to estimate the spectral transformation? Default is TRUE to not reduce the signal of high variance covariates, and we do not know of a scenario where this hurts.

verbose

If TRUE fitting information is shown.

Value

Object of class SDForest containing:

predictions

Vector of predictions for each observation.

forest

List of SDTree objects.

var_names

Names of the covariates.

oob_loss

Out-of-bag loss. MSE

oob_SDloss

Out-of-bag loss using the spectral transformation.

var_importance

Variable importance. The variable importance is calculated as the sum of the decrease in the loss function resulting from all splits that use a covariate for each tree. The mean of the variable importance of all trees results in the variable importance for the forest.

oob_ind

List of indices of trees that did not contain the observation in the training set.

oob_predictions

Out-of-bag predictions.

If return_data is TRUE the following are also returned:

X

Matrix of covariates.

Y

Vector of responses.

Q

Spectral transformation.

If envs is provided the following are also returned:

envs

Vector of environments.

nTree_env

Number of trees for each environment.

ooEnv_ind

List of indices of trees that did not contain the observation or the same environment in the training set for each observation.

ooEnv_loss

Out-of-bag loss using only trees that did not contain the observation or the same environment.

ooEnv_SDloss

Out-of-bag loss using the spectral transformation and only trees that did not contain the observation or the same environment.

ooEnv_predictions

Out-of-bag predictions using only trees that did not contain the observation or the same environment.

nTree_leave_out

If environments are left out, the environment for each tree, that was left out.

nTree_env

If environments are provided, the environment each tree is trained with.

References

Author

Markus Ulmer

Examples

set.seed(1)
n <- 50
X <- matrix(rnorm(n * 5), nrow = n)
y <- sign(X[, 1]) * 3 + rnorm(n)
model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5)
predict(model, newdata = data.frame(X))
#>         1         2         3         4         5         6         7         8 
#> -1.680850  2.056027 -1.680850  2.056027  2.056027 -1.680850  2.056027  2.056027 
#>         9        10        11        12        13        14        15        16 
#>  2.056027 -1.680850  2.056027  2.056027 -1.680850 -1.680850  2.056027 -1.680850 
#>        17        18        19        20        21        22        23        24 
#> -1.680850  2.056027  2.056027  2.056027  2.056027  2.056027  2.056027 -1.680850 
#>        25        26        27        28        29        30        31        32 
#>  2.056027 -1.680850 -1.680850 -1.680850 -1.680850  2.056027  2.056027 -1.680850 
#>        33        34        35        36        37        38        39        40 
#>  2.056027 -1.680850 -1.680850 -1.680850 -1.680850 -1.680850  2.056027  2.056027 
#>        41        42        43        44        45        46        47        48 
#> -1.680850 -1.680850  2.056027  2.056027 -1.680850 -1.680850  2.056027  2.056027 
#>        49        50 
#> -1.680850  2.056027 

# \donttest{
set.seed(42)
# simulation of confounded data
sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2)
X <- sim_data$X
Y <- sim_data$Y
train_data <- data.frame(X, Y)
# causal parents of y
sim_data$j
#> [1]  88 112

# comparison to classical random forest
fit_ranger <- ranger::ranger(Y ~ ., train_data, importance = 'impurity')

fit <- SDForest(x = X, y = Y, nTree = 10, Q_type = 'pca', q_hat = 2)
fit <- SDForest(Y ~ ., nTree = 10, train_data)
fit
#> SDForest result
#> 
#> Number of trees:  10 
#> Number of covariates:  150 
#> OOB loss:  0.73 
#> OOB spectral loss:  0.11 

# comparison of variable importance
imp_ranger <- fit_ranger$variable.importance
imp_sdf <- fit$var_importance
imp_col <- rep('black', length(imp_ranger))
imp_col[sim_data$j] <- 'red'

plot(imp_ranger, imp_sdf, col = imp_col, pch = 20,
     xlab = 'ranger', ylab = 'SDForest', 
     main = 'Variable Importance')


# check regularization path of variable importance
path <- regPath(fit)
# out of bag error for different regularization
plotOOB(path)

plot(path)


# detection of causal parent using stability selection
stablePath <- stabilitySelection(fit)
plot(stablePath)


# pruning of forest according to optimal out-of-bag performance
fit <- prune(fit, cp = path$cp_min)

# partial functional dependence of y on the most important covariate
most_imp <- which.max(fit$var_importance)
dep <- partDependence(fit, most_imp)
plot(dep, n_examples = 100)

# }