Estimates the optimal complexity parameter for the SDTree using cross-validation. The transformations are estimated for each training set and validation set separately to ensure independence of the validation set.

cvSDTree(
  formula = NULL,
  data = NULL,
  x = NULL,
  y = NULL,
  max_leaves = NULL,
  cp = 0,
  min_sample = 5,
  mtry = NULL,
  fast = TRUE,
  Q_type = "trim",
  trim_quantile = 0.5,
  q_hat = 0,
  Qf = NULL,
  A = NULL,
  gamma = 0.5,
  gpu = FALSE,
  mem_size = 1e+07,
  max_candidates = 100,
  nfolds = 3,
  cp_seq = NULL,
  mc.cores = 1,
  Q_scale = TRUE
)

Arguments

formula

Object of class formula or describing the model to fit of the form y ~ x1 + x2 + ... where y is a numeric response and x1, x2, ... are vectors of covariates. Interactions are not supported.

data

Training data of class data.frame containing the variables in the model.

x

Predictor data, alternative to formula and data.

y

Response vector, alternative to formula and data.

max_leaves

Maximum number of leaves for the grown tree.

cp

Complexity parameter, minimum loss decrease to split a node. A split is only performed if the loss decrease is larger than cp * initial_loss, where initial_loss is the loss of the initial estimate using only a stump.

min_sample

Minimum number of observations per leaf. A split is only performed if both resulting leaves have at least min_sample observations.

mtry

Number of randomly selected covariates to consider for a split, if NULL all covariates are available for each split.

fast

If TRUE, only the optimal splits in the new leaves are evaluated and the previously optimal splits and their potential loss-decrease are reused. If FALSE all possible splits in all the leaves are reevaluated after every split.

Q_type

Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'. 'trim' corresponds to the Trim transform Cevid2020SpectralModelsSDModels as implemented in the Doubly debiased lasso Guo2022DoublyConfoundingSDModels, 'pca' to the PCA transformationPaul2008PreconditioningProblemsSDModels. See get_Q.

trim_quantile

Quantile for Trim transform, only needed for trim and DDL_trim, see get_Q.

q_hat

Assumed confounding dimension, only needed for pca, see get_Q.

Qf

Spectral transformation, if NULL it is internally estimated using get_Q.

A

Numerical Anchor of class matrix. See get_W.

gamma

Strength of distributional robustness, \(\gamma \in [0, \infty]\). See get_W.

gpu

If TRUE, the calculations are performed on the GPU. If it is properly set up.

mem_size

Amount of split candidates that can be evaluated at once. This is a trade-off between memory and speed can be decreased if either the memory is not sufficient or the gpu is to small.

max_candidates

Maximum number of split points that are proposed at each node for each covariate.

nfolds

Number of folds for cross-validation. It is recommended to not use more than 5 folds if the number of covariates is larger than the number of observations. In this case the spectral transformation could differ to much if the validation data is substantially smaller than the training data.

cp_seq

Sequence of complexity parameters cp to compare using cross-validation, if NULL a sequence from 0 to 0.6 with stepsize 0.002 is used.

mc.cores

Number of cores to use for parallel computation.

Q_scale

Should data be scaled to estimate the spectral transformation? Default is TRUE to not reduce the signal of high variance covariates, and we do not know of a scenario where this hurts.

Value

A list containing

cp_min

The optimal complexity parameter.

cp_table

A table containing the complexity parameter, the mean and the standard deviation of the loss on the validation sets for the complexity parameters. If multiple complexity parameters result in the same loss, only the one with the largest complexity parameter is shown.

References

Author

Markus Ulmer

Examples

set.seed(1)
n <- 50
X <- matrix(rnorm(n * 5), nrow = n)
y <- sign(X[, 1]) * 3 + rnorm(n, 0, 5)
cp <- cvSDTree(x = X, y = y, Q_type = 'no_deconfounding')
cp
#> $cp_min
#>    cp 
#> 0.076 
#> 
#> $cp_table
#>          cp SDLoss mean SDLoss sd
#>  [1,] 0.044    30.89977  4.336556
#>  [2,] 0.056    30.65170  5.398475
#>  [3,] 0.058    30.60373  5.370633
#>  [4,] 0.070    31.28713  5.866639
#>  [5,] 0.076    29.12192  5.155334
#>  [6,] 0.106    31.64469  3.284572
#>  [7,] 0.132    32.25506  3.564472
#>  [8,] 0.262    30.93818  5.837083
#>  [9,] 0.326    31.50641  6.333261
#> [10,] 0.362    33.27446  3.286796
#> [11,] 0.600    34.64608  4.718930
#>